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ABSTRACT 


Solid waste problem become a serious issue for the countries around the world 
since the amount of generated solid waste increase annually. As an effort to 
reduce and reuse of solid waste, a classification of solid waste image is needed 
to support automatic waste sorting. In the image classification task, image 
segmentation and feature extraction play important roles. This research applies 


recent deep leaning-based segmentation, namely pyramid scene parsing 
network (PSPNet). We also use various combination of image feature 
Keywords: extraction (color, texture, and shape) to search for the best combination of 
features. AS a comparison, we also perform experiment without using 


Feature extraction segmentation to see the effect of PSPNet. Then, support vector machine 


PSPNet (SVM) is applied in the end as classification algorithm. Based on the result of 
Segmentation experiment, it can be concluded that generally applying segmentation provide 
SVM better source for feature extraction, especially in color and shape feature, hence 


Waste classification increase the accuracy of classifier. It is also observed that the most important 
feature in this problem is color feature. However, the accuracy of classifier 
increase if additional features are introduced. The highest accuracy of 76.49% 
is achieved when PSPNet segmentation is applied and all combination of 


features are used. 
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1. INTRODUCTION 

Solid waste problem become a serious issue for the countries around the world since the amount of 
generated solid waste increase annually. In 2016 the total generation of solid waste by the world’s cities was 
up to 2.01 billion tonnes. It was equal to 0.74 kilogram of solid waste generated by a person in a day. This 
number is estimated to increase by 70% or up to 3.40 billion tonnes of solid waste in 2050. Population growth 
and urbanization are the most siginificant factors that trigger the increase in the amount of waste. Poor 
management of waste may create serious problem related to health, safety, and environment. Therefore, proper 
waste management strategy is needed to minimize such negative impacts [1]. 

One of the effort to reduce the number of solid waste is by improving the waste reusability. Waste 
sorting plays significant role to support the waste reusability [2]. Since the number of waste is great and the 
awarness of people in waste sorting is still low, an automatic waste sorting is needed. The starting point to 
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produce an automatic waste sorting is by building a classification model that may recognize the type of waste 
image. 

Some previous studies have been investigated the use of machine learning algorithm to classify or 
recognize waste image. Mustaffa et al. [3] and Torres-gracia et al. [4] classified waste image into three classes 
using conventional machine learning algorithm and were able to achieve good accuracies, but their experiment 
used only 20 samples in each class. Therefore, the generalization of the resulting model for classifying the 
variety of real waste image could not be assured. Adedeji and Wang [5] and Costa et al. [6] classified waste 
image using deep learning model and more number of samples, but they used the capture of waste image 
directly without any segmentation. On the other hand, segmentation is one of the most important part in the 
image preprocessing. Poor segmentation result may degrade the performance of the subsequent processes, such 
as feature extraction and classification [7]. 

Segmentation is the process of partitioning an image into different several disjoint subset [8], for 
example partitioning an image into background and foreground. Segmentation can also be used to extract 
region of interest of an image [7]. State of the art of segmentation methods are kind of deep learning algorithm 
with special architecture, such as encoder-decoder which have better performance than the conventional one 
(such as thresholding method) [9]. Pyramid scene parsing netwrok (PSPNet) is a kind of deep learning network 
that can be used for semantic image segmentation. PSPNet successfully outperformes other deep learning based 
segmentation in some large benchmark dataset, such as fully convolutional network (FCN), DeepLab, deep 
parsing network (DPN) and Laplacian pyramid reconstruction and refinement (LRR). PSPNet is able to achieve 
better segmentation result because it considers global context of the image and uses pyramid pooling module 
to obtain different region based context of an image [10]. 

In addition, feature extraction of the image must be determined properly to achieve good classification 
result [11], [12]. Feature extraction is aimed to extract relevant subset of features from an image and to reduce 
the large dimension of image to the lower dimensional set of image features [13]. Color, texture, and shape are 
the most visual features extracted from an image. Colormoments is one of the simplest color feature compared 
to the other, such as color histogram, color coherence vector, and color correlogram [14]. Color moment is also 
proven to be effective and efficient for extracting color features of an image [15]. In addition texture feature is 
also important to extract the relationship from neighboring pixel. Gray level co-occurence matrix (GLCM) is 
one of the popular texture-based feature extraction that has been successfully applied in many computer vision 
problem [16]-[18]. The other important image feature to describe the object of an image is shape feature. Some 
morpohlogical features, such as area, perimeter, major and minor axis, centroid-x, centroid-y, roundness, 
rectangularity, eccentricity and elongation, can be used as shape descriptor [19]. In addition, Hu moment is 
also important to extract shape features. Hu moment is region-based method that uses second and third order 
central moments and constructs seven invariant moments whose values are not affected when the image is 
translated, rotated, or scaled [20]. 

In this research, we propose the application of PSPNet as segmentation to provide good source for 
feature extraction. We also use various combination of image feature extraction (color, texture, and shape) to 
search fo the best combination of features for solid waste image classification. As a comparison, we also 
perform experiment without using segmentation to see the effect of PSPNet. Then, support vector machine 
(SVM) is applied at the end as a classifier. SVM is a binary classification algorithm proposed by Cortes and 
Vapnik [21] which works by finding the optimal hyperplane to maximize the separation between binary class 
data. SVM has been successfully applied in various classification problem and proven to better than other 
popular classification algorithm, such as artificial neural network (ANN) [22], [23], Naive Bayes classifier dan 
random forest [24]. 


2. RESEARCH METHODOLOGY 

Figure 1 shows the stages of process in this research. First, the image dataset is segmented by using 
PSPNet segmentation, then the process is continued by feature extraction, classification, and evaluation. As a 
comparison, to examine the effect of PSPNet, we also perform experiment without using segmentation, hence 
the process of segmentation in Figure 1 is skipped. 
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Figure 1. Research methodology 
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2.1. Dataset 

Public trash image dataset from Trashnet are used in this research as source data for conducting 
experiments. Trashnet dataset was collected by Yang and Thung [25]. This dataset contain 2,527 trash images 
of 224x224 pixels which is grouped into six classes: glass (501), paper (594), cardboard (403), plastic (482), 
metal (410), and trash (137). A sample image from each class can be seen in Figure 2 [25]. 





glass paper cardboard plastic metal 


Figure 2. Sample image from each class of Trashnet dataset [25] 


2.2. PSPNet for image segmentation 

PSPNet is performed to generate segmented binary image, then the bounding box of segmented image 
are calculated and image is cropped so that only the main object remain. PSPNet is a kind of deep learning 
network for semantic image segmentation. PSPNet outperformed FCN based segmentation because PSPNet 
consider global context of the image and uses pyramid pooling module to obtain different region based context 
of an image. The architecture of PSPNet is shown in Figure 3 [10]. 
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Figure 3. The Architecture of PSPNet [10]: (a) input image, (b) feature map, (c) pyramid pooling module, 
and (d) final prediction 


First, an input image (Figure 3(a)) is fed into convolutional neural networks (CNN) with dilated 
network strategy which genarate feature map (Figure 3(b)) with size 1/8 of original input image. Then, the 
feature map is forwarded to the pyramid pooling module (Figure (c)) which generate the concatenated feature 
map in the end of the module. In the last step (Figure (d)) convolution layer is applied on the concatenated 
feature map to generate the final prediction of each pixel in an image. There are four operations in the pyramid 
poom module as [10], [26]: 

Sub region average pooling 
Each feature map is pooled over different sub-region to obtain different context reprsentation in each 
sub-region. In the first level (red), the global average pooling is performed in each feature map. The result 
is a single bin output for each fature map. In the second level (orange), third level (blue), and fourth level 
(green), each feature map is divided into 2x2, 3x3 and 6x6 sub-region, respectively, then each sub-region 
is pooled by average pooling. 

— Convolution 
The 1x1 convolution is performed at each level to reduces the size of feature map at each level into 1/N 
of the original one (black) where N is the level size of pyramid. 

— Upsampling 
Upsampling is performed by using bilinear interpolation to make each feature map have the equal size as 
the original one (black). 

— Concatenation 
The original feature map (black) and all upsampled feature map from the first to fourth level are 
concatenated and the result is forwarded to convolutional layer for prediction. 
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The process of segmentation using PSPNet consists of training and testing. Dataset is divided into 
70% of training data, 15% of validation data, and 15% of testing data. Training is performed using some 
combination of hyperparameter: learning rate (0.001, 0.0001, and 0.00001) and batch (5 and 10) in 50 epoch. 
After training using such combination of hyperparameters, six models of image segmentation are obtained, 
then testing data is used to evaluate and select the best model. Dice coefficent (DC) is used to evaluate the 
results of segmentation as shown in (1) where A and B is the image regions being compared [27]. 


__ 2|AnB| 
|A|+ |B| 





(1) 


2.3. Feature extraction 

Feature extraction is aimed to extract relevant subset of features from an image [13]. This research 
uses three kinds of image features, namely color features extracted by using color moments, texture features 
extracted by using gray level co-occurence matrix (GLCM) and shape features. Experiments are run using one 
or combination of such features to obtain the best classification result. Table 1 shows the comparison of source 
image for each feature extraction method. When using PSPNet segmentation, original red, green, and blue 
(RGB) image is segmented resulting the segmented binary image. Then, for extraction of color and texture 
features, the image is cropped around the bounding box using OpenCV library, findcontour. When the 
segmentation is skipped, before shape features are extracted, each image is converted into binary image by 
using inverse binary thresholding (value of threshold = 128). 


Table 1. The comparison of input image for feature extraction between PSPNet segmentation and without 
















segmentation 
Feature Using PSPNet segmentation Without segmentaion 
Color 
Cropped bounding box of segmented RGB image Original RGB image 
Texture 
Cropped bounding box of segmented grayscale image Transformation from original RGB to grayscale image 
Shape 
Segmented binary image Transformation from original RGB to binary image using 


inverse binary tresholding 


2.3.1. Colormoments 

Color feature is visual feature that can be used to discriminate or recognize visual information. If the 
color distribution of an image is interpreted as a probability distribution, then color moments can be used to 
characterize the color distribution [28]. Three color moments (mean, standar deviation, and skewness) are 
extracted for every image channel, therefore there are 9 numerical values extracted for an image in RGB color 
space. Mean £E; is the average of pixel values as shown in (2), standard deviation cg; is the variation of pixel 
values as shown in (3), and skewness s; is the degree of asymmetry in the color distribution in an image channel 
as shown in (4). N is total number of pixel in each channel and p;; is the j-th pixel value in channel i [15]. 


1 


E; = D Pij (2) 
0; = = a E;)2) (3) 
sa (= as E)? ) (4) 
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2.3.2. Gray level co-occurence matrix (GLCM) 

GLCM is a method for extracting texture features of an image. First, co-occurence matrix P is created. 
P is a square matrix whose size is equal to the number of gray intensity value of an image. Each element p;; in 
the matrix is the number of occurence (frequency) of two neigboring pixel in specific orientation where the 
gray intensity value of the first pixel is equal to i and gray intensity value of the second pixel is equal to j [29]. 
Neighboring pixel can be selected based on specified spatial orientation. For example when the orientation is 
0°, then the neighbor of a pixel is a pixel that is on the right side. The resulting GLCM matrix can be obtained 
by making P as symmetrical matrix (adding matrix P with its transpose) and then normalizing the value of each 
element into [0, 1]. Some metrics can be calculated based on the resulting GLCM matrix, they are contrast, 
angular second moment (ASM), energy, homogenity, correlation, and dissimilarity. The detail formula for each 
metric can be referred at [30]. In this research, we construct GLCM matrix in various spatial orientation 
(0°, 45°, 90°, and 135°). 


2.3.3. Shape 

Shape is also prominet feature to discirminate an image to another. This research extracts shape 
descriptors of an image from morphological features and Hu Invariant Moment. Some morpohlogical features 
extracted are area, perimeter, major and minor axis, centroid-x, centroid-y, roundness, rectangularity, 
eccentricity, elongation, dispersion I, dispersion IR, convexity, and solidity [19]. The illustration of such 
morphological shape features can be seen in Figure 4. 

In addition, Hu moment is also performed to extract shape descriptor of an image. Hu moment is 
region-based method that uses second and third order central moments and constructs 7 invariant moments. 
The value of invariant moment features are not affected when the image is translated, rotated, or scaled. The 
detail oh Hu moment can be referred in [20]. 
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Figure 4. The illustration of shape descriptors from morphological features 


2.4. Classification 

Classification consists of training and testing. Training is used to build the classifier model, while 
testing is used to evalute the performance of the model as illustrated in Figure 5. First the dataset is splitted by 
using 10-fold cross validation. This research applies SVM as classification algorithm. 
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Figure 5. Traning and testing in classification 
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SVM training algorithm works by finding the optimal hyperplane that maximize the separation 
between binary class data. The closest training data to the optimal hyperplane that defined the optimal margin 
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are called support vectors [21]. When the data are non-linearly separable, non-linear mapping @(X;) is applied 
to transform the original data into higher dimension [31]. Let the (x;, y;)/_, where x; € R® is input training 
data, y; is targeted data and N is the number of training data, SVM find the solution by solving the following 
optimization problem as show in (5) where w is weight vector and C is error penalty. Such optimization 
problem can be solved using Lagrangian formulation. The training data x; is normalized into [0, 1] before they 
are inputted to the SVM and y; is set into -1 or 1 [32]. 


min -w’w+C ya (5) 

w,b, 2 i 

subjectto y,;(w’ $(x;) +b) =>1- &, 

In order to reduce the computational cost when working with nonlinear data, kernel tricks can be used 


to substitute the dot product between transform data tuples as (6). Some popular kernel function can be used, 
such as polynomial and radial basis function (RBF) as shown in (7) and (8), respectively [33]. 


K(x;,x;) = b(x;). 0(x;) (6) 
K (x;,x;) = exp (-r Ibs = x; | ") (7) 
K(x, x;) = (v(x x;) +r)’ (8) 


Once the optimization problem solved, the optimal hyperplane and the support vectors are obtained. 
Then, the output out(x,) of a new test sample x; can be determined by using (9) where x; are support vector, 
y; is class label of i-th support vector, l is the number of support verctors, a; is Lagrange multipliers, and b is 
bias [32]. This research applies the one-versus-rest strategy to handle the multiclass classification problem, 
because Trashnet dataset consist of 6 classes. 


out (x;) = sgn(Di-1 Yii K (Xi, xe) + b) (9) 


2.5. Evaluation 

Evaluation is performed to evaluate the resulting classification model. In this research, evaluation of 
classification model is measured in term of accuracy. Accuracy shows the ratio between the correctly predicted 
data and the total number of data [34]. 


3. RESULTS AND ANALYSIS 

This research is performed in two main scenario. The first scenario is performed by using PSPNet 
segmentation, while the second scenario skip the process of segmentation. In each scenario, single or 
combination feature extraction of color (colormoments), texture (GLCM) and shape (morphological features 
and Hu Invariant Moments) are experimented to search for the best image features that well describe the trash 
image in order to reach the best classification results. GLCM feature extarction method is performed in various 
spatial orientation: 0°, 45°, 90°, and 135°. Then in the classification, SVM training algorithm is performed using 
some combination of parameters, namely kernel function (RBF and polynomial) and error penalty C (1 or 100). 
Therefore, for each feature extraction in a scenario, classification with SVM is performed four times using 
different combination of kernel function and error penalty C. In the last section the results of the first scenario 
and the second scenario are compared. 


3.1. The first scenario 

In this scenario segmentation is performed in the first stage by using PSPNet. In order to obtain the 
best model of PSPNet, this reseach try some combination of hyperparameter: learning rate (0.001, 0.0001, and 
0.00001) and batch (5 and 10). An experiment for each combination of hyperparameter is performed in 50 
epoch. 

Based on Figure 6(a), it is shown that the learning rate of 0.0001 gives the best results than other 
values. It can be explained that when the learning rate is too small, the progress of network learning is very 
slow, then the result is lower. Conversely, when the learning rate is too high the progress of network learning 
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may diverge, then the network is failed to achieve the best result. Figure 6(b) shows that the batch value of 5 
is able to reach better performance than the batch value of 10. It can be explained that in this case the stochastic 
nature of using lower number of mini batch may lead to find the optimum solution. Therefore, the segmentation 
in the rest of experiment are performed by using the best segmentation model trained by those combination of 
parameter. The results of segmentation using PSPNet for sample images in Figure 2 can be seen in Figure 7. 
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Figure 6. Experiment results of PSPNet segmentation on testing dataset: (a) the average result of all variation of 
batch values in each learning rate and (b) the average results of all variation of learning rate values in each batch 
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Figure 7. The sample results of segmentation using PSPNet 


The result of experiment in this scenario for various features with the best accuracy in each SVM 
kernel (RBF and polynomial) can be seen in Figure 8. It is shown that RBF kernel is better than polynomial 
kernel in most of experiments, but when more combination of features are used, the polynomial kernel are 
better than RBF kernel. The highest accuracy of 76.49% in this scenario is achieved by polynomial kernel with 
C =1 when using combination features of color, GLCM 135 and shape. While the highest accuracy of RBF 
kernel in this scenario is 74.55% when using C = 100 and the same combination features of color, GLCM 135 
and shape. Therefore, it can be concluded that when segmentation is used, the performance of classification 
increase as the more combination of features are used. The use of more combination of features give the more 
representative feature sets of an image, therefore the accuracy of classification increase. However, the most 
important feature is the color feature. When the color feature is removed, the accuracy of classifier decrease. 


3.2. The second scenario 

The second scenario is performed without segmentation in the preprocessing. The result of experiment 
in this scenario for various features with the best accuracy in each SVM kernel (RBF and polynomial) can be 
seen in Figure 9. It is shown that RBF kernel is better than polynomial kernel in most of experiments. However, 
the highest accuracy of 74.83% in this scenario is achieved by polynomial kernel with C = 100 when using 
combination features of color and GLCM 90. While the best result of RBF kernel in this scenario is 74.55% 
when using C = 100 and combination features of color and GLCM 135. Therefore, it can be concluded that 
when segmentation is not used, the best combination of feature that well describe the trash image is 
combination of color and GLCM. When, the shape features are added, the performance of classification 
decrease. To extract shape features in this scenario, a conventional thresholding operation is applied to 
transform a RGB image into binary image, thefore the resulting binary image is not good enough as source for 
extracting shape features. 


3.3. Comparison of the first scenario and the second scenario 

Figure 10 shows the comparison between the first scenario and the second scenario. Based on 
Figure 10, it is shown that the first scenario (using PSPNet segmentation) is better than the second scenario 
(without segmentation) in most of experiment. The second scenario outperforms the first scenario only in 5 
from 19 experiments. Therefore, it can be concluded that generally applying PSPNet segmentation provide 
better source for feature extraction, especially in color and shape feature, hence increase the performance of 
classification. 
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Figure 8. Result of experiments in the first scenario 
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Figure 9. Result of experiments in the second scenario 
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Figure 10. Comparison between the first scenario (using segmentation) and the second scenario (without 
segmentation) 


It is also observed that the most important feature in this problem is color feature. When using single 
feature, color feature provide the highest result compared to GLCM (texture) and shape feature, both in the 
first and the second scenario. However, the accuracy increase if additional features are introduced. In the first 
scenario better results are achieved when using all combination of features, while in the second scenario better 
results are achieved when using only color and texture features. Therefore, it can be concluded that when 
segmentation is applied by using PSPNet, the segmented binary image provide better source for shape feature 
extraction. Conversely, when the binary image is only obtained by using inverse binary thresholding, the result 
is not good enough for shape feature extraction. Hence, the accuracy of classification decrease when shape 
feature is added in the second scenario. From all combination of parameters conducted in this research, the 
highest accuracy of 76.49% is achieved when using PSPNet segmentation and all combination of features 
(color, texture, and shape). 

The results of this research show that the combination of features are able to increase the performance 
of the resulting model than when using the individual feature, but they are still not enough to uniquely 
characterize each class of solid waste image. The more representative additional features are still required to 
improve the performance of classifier. The tuning of parameter of classification algorithm also need to be 
explored to obtain better classification results. 


4. CONCLUSION 

In this research we apply PSPNet as segmentation and combination of image feature extraction (color, 
texture, and shape) to classify the solid waste image. As a comparison, to see the effect of PSPNet 
segmentation, we also perform experiment without using segmentation. Based on the result of experiment, it 
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can be concluded that generally applying segmentation provide better source for feature extraction, especially 
in color and shape feature, hence increase the accuracy of classification. It is also observed that the most 
important feature in this problem is color feature, both when the segmentation is applied or not. However, the 
accuracy of classifier increase if additional features are introduced. When segmenation is not used, better result 
is achieved when using only color and texture features, while when segmentation is applied the highest 
accuracy of 76.49% is achieved when using all combination of features. 
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